Welcome to R!

Purpose of this tutorial

R is a great resource that has extremely good documentation, including several free books that systematically teach users the basics and then some. This tutorial is not meant to replace any such resources but to provide an inital project to try out R and R Studio. As such, it will present a streamlined process with jumping off points. If you are interested in learning more R there will be a recommended booklist at the end.

RStudio Cloud

To complete this tutorial, you will need to create an account with RStudio Cloud so navigate to their homepage and click Get Started.

Once you’ve made your account, you will be navigated to Your Workspace. Click the blue New Project button to the right of Your Projects and wait for the Deploying Project bubbles to dissappear.

R Script

Under File highlight New File and select R Script.

Go ahead and save the file. This will be where we will be executing all our commands.

Set up the environment

Open Tools and click on Global Options. There will be four sections, R Sessions, Workspace, History and Other. Under Workspace, uncheck Restore .RData into workspace at startup and change Save workspace to .RData on exit to never. Hit Ok to save and exit.

By changing these options, we are making it a little harder to pick up where you left off in a project after you close it, but This will make your code reproducible, allowing you to get the same results every time you run your code.

Run some code

Speaking of running code, R is a language that runs only selected code. What does that mean? It means that if your cursor is on a specific line, everything on that line is selected, and will be run. If you hightlight some of your code, only what is highlighted is selected and will be run.

Try it out by typing 2 + 2 on the first line of your blank R script and 4 + 4 on the next one, Hit the Run button at the upper right corner of your R script and see what happens, (you can also use the keyboard shortcut ctrl-enter).

All of the functions we will be using are well documented. If you ever have questions, run them in the console, (lower left pane) with a ?in front, e.g. ?help()

R and Data

Load the Tidyverse

This tutorial makes extensive use of the Tidyverse.

If you have never used the Tidyverse before, you are about to meet one of the best packages in R. Or group of packages. The Tidyverse contains multiple packages designed to clean data, wrangle it and graph it. The packages are compatible with each other, so you can easily move through each stage of cleaning, plotting and presenting.

To install the Tidyverse, either click on the Packages tab in the lower right, then the install button and type tidyverse. It will take a moment to run and will fill your console with red text. The installation will finish with The downloaded source packages are in ‘/tmp/RtmpPmVnsD/downloaded_packages’. Now we can load the package into R using library(). This code should go at the top of your R Script.

library(tidyverse)
## -- Attaching packages ------------------------------------------------------ tidyverse 1.2.1 --
## v ggplot2 3.2.1     v purrr   0.3.2
## v tibble  2.1.3     v dplyr   0.8.3
## v tidyr   1.0.0     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.4.0
## -- Conflicts --------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

Let’s install one more package, maps, beginning with the installation.

library(maps)
## 
## Attaching package: 'maps'
## The following object is masked from 'package:purrr':
## 
##     map

We only need to install packages once, after that we can always load them for use using the library() call.

Loading data

We will be working with a semi-cleaned version of a UFO data set from Kaggle. The original data set can be found here. We will be only looking at data for the US. You can get the zipped file here: ufos.zip. R Studio only takes data in a zipped format, so we will actually leave the file as is.

To get the data, you will need to upload it into your enviroment. Under the files tab in the lower right pane, click on upload button. Find the zipped file called ufos on your computer and upload it. You should see ufos.csv appear in your files.

Let’s read in the csv as a tibble, which is a slightly more flexible version of a data frame and store it as a variable which we will use to call it for the rest of the lab.

ufos <- as_tibble(read_csv("ufos.csv"))

You will get the error:

Warning: 2 parsing failures.
  row                col               expected actual       file
22952 duration..seconds. no trailing characters      ` 'ufos.csv'
29401 duration..seconds. no trailing characters      ` 'ufos.csv'`

You can ignore it.

You should see ufos show up as data set in your Global Enviroment on the upper right of your screen. If you click on it, you will open the data in a new tab.

Look at our data

Let’s take a quick look at our data.

glimpse(ufos)
## Observations: 65,114
## Variables: 14
## $ datetime             <dttm> 1949-10-10 20:30:00, 1956-10-10 21:00:00...
## $ city                 <chr> "san marcos", "edna", "kaneohe", "bristol...
## $ state                <chr> "tx", "tx", "hi", "tn", "ct", "al", "fl",...
## $ country              <chr> "us", "us", "us", "us", "us", "us", "us",...
## $ shape                <chr> "cylinder", "circle", "light", "sphere", ...
## $ duration..seconds.   <dbl> 2700, 20, 900, 300, 1200, 180, 120, 300, ...
## $ duration..hours.min. <chr> "45 minutes", "1/2 hour", "15 minutes", "...
## $ comments             <chr> "This event took place in early fall arou...
## $ date.posted          <date> 2004-04-27, 2004-01-17, 2004-01-22, 2007...
## $ latitude             <dbl> 29.88306, 28.97833, 21.41806, 36.59500, 4...
## $ longitude            <dbl> -97.94111, -96.64583, -157.80361, -82.188...
## $ year                 <dbl> 1949, 1956, 1960, 1961, 1965, 1966, 1966,...
## $ month                <dbl> 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 1...
## $ day                  <dbl> 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 1...

The glimpse() function gives us a compressed view of the data, but here’s a few things we can gather from the print out.

  • We have 14 variables and 65,144 observations
  • 7 variables are related to time
  • 5 variables are related to place
  • We have 2 variables in date format, 6 in character and 6 in numerical format.
  • We have 8 continuous variables, and 6 discrete.

Form some questions

Now that we’ve made some observations about our data, lets ask some questions:

  • Which state has the most sightings?
  • Have the number of sightings increased over time?
  • Which shapes are seen most commonly?
  • What are the average lengths of times each shape is seen?
  • Where are the ufos being spotted?

Plotting

Plot 1: Which state has the most sightings?

Let’s start with the data. If we write the name of the variable that holds our data set, ufos, we will get all the data, which is not exactly what we’re looking for. We really want to know which state has seen the most ufos, which means we want the total number of observations for each state.

To do this, we’re going to use the count function. We give it a column name (or variable), and it will count up the rows (or observations) according to each value in the column. This works well with a discrete variable but would not work well with a continious variable.

ufos %>% 
  count(state)
## # A tibble: 52 x 2
##    state     n
##    <chr> <int>
##  1 ak      319
##  2 al      642
##  3 ar      588
##  4 az     2414
##  5 ca     8912
##  6 co     1413
##  7 ct      892
##  8 dc        7
##  9 de      166
## 10 fl     3835
## # ... with 42 more rows

We are using what is called a pipe :%>%. It essentially says, run this line and then take the information from that line and use it for the next line. So instead of writing count(x = ufos, vars = state) we can write it with pipes. The other thing we can see, is that we don’t have to write out what each entry is in the count function if we go in order. So we can also write count(ufos, state).

Let’s arrange the data so we can see which states have the most observations. We will use a pipe so we don’t have to save our data in a variable. Since arrange() automatically sorts from least to greatest, we will use desc() to flip this.

ufos %>% 
  count(state) %>% 
  arrange(desc(n))
## # A tibble: 52 x 2
##    state     n
##    <chr> <int>
##  1 ca     8912
##  2 wa     3966
##  3 fl     3835
##  4 tx     3447
##  5 ny     2980
##  6 il     2499
##  7 az     2414
##  8 pa     2366
##  9 oh     2275
## 10 mi     1836
## # ... with 42 more rows

If we want to see all the data, we can always pipe the data into a view(), but lets go even farther. Let’s make a graph. We are going to use ggplot2 (to learn more about it run ?ggplot2). There are three things you need to know immediately to use ggplot.

  1. To create a graph, you will always need to call ggplot() and then the type of chart you want. ggplot creates the base layer that you will add plots to, the geoms are the charts you will be adding.
  2. There are two parts to each ggplot or geom function, the aes() or aesthetics and the other part. The data set, and anything you don’t want to vary will be in the other part, while any variables you want to vary per observation will be in the aes().
  3. ggplot takes information in order, which means that if you write your data in the right order you won’t have to write out the full equality (like we do below).
ufos %>% 
  count(state) %>% 
  ggplot(aes(x = state, y = n)) +
  geom_bar(stat = "identity") 

  • We can pipe our data right into the plot.
  • We use + as continue or add instead of %>% with ggplot2
  • geom_bar() requires stat = 'identity' to use n as the height of each bar. Use ?geom_bar() to find out why.

It is pretty hard to read this chart, so let’s filter our data to just look at which states had more than 1000 observations. We can just add the filter() between the count() and the plot. Let’s also tack a coord_flip() on which renders x values along the y axis and y values along the x axis.

ufos %>% 
  count(state) %>% 
  filter(n >= 1000) %>% 
  ggplot(aes(state, n)) +
  geom_bar(stat = "identity") +
  coord_flip()

Plot 2: How have sightings changed over time?

California is by far winning for the amount of sightings, so let’s see if there is a change over time.

First, let’s filter by state, and then count up the number of observations by year. We can pipe that data directly into a bar chart.

ufos %>% 
  filter(state == "ca") %>% 
  count(year) %>% 
  ggplot(aes(year, n)) +
  geom_bar(stat = "identity") 

Try making some of these other plots:

Plot 3: Which shapes are seen most commonly?

Take a moment to think about how to construct this before looking at the code.

ufos %>% 
  count(shape) %>% 
  filter(n >= 1000) %>% 
  ggplot(aes(shape, n)) +
  geom_bar(stat = "identity") +
  coord_flip()

Note:

  • We added coord_flip() so that the terms would be readable.
  • Having trouble understanding how to use the functions? use ?filter(), ?count() or ?geom_bar() to see what’s tripping you up.

Plot 4: What are the average lengths of times each shape is seen?

Let’s find the average length of time that each shape was seen. R has a great function called mean() which we can just use. However, we will have to do a little more work than using the count() function.

What we want to do, is group all the observations by their shape and then take the average of that number. We’re going to use a group_by() to collect the data by shape. While this doesn’t appear to change the data, it will allow us to use summarize.

If we wanted to count the number of observations in each shape using this method, we would group by shape and summarize using the n() function which ennumerates the observations. (Nothing ever goes in those parentheses). It would look like this ufos %>% group_by(shape) %>% summarize( obs = n()). But we are looking at the average time. This means that we want keep grouping by shape, but instead of counting in summarize() we want to create a new column that will hold the mean. The variable that hold the data we want is called duration..seconds. The observations are in seconds, so we divided it by 60 to have the values in minutes.

ufos %>% 
  group_by(shape) %>% 
  summarise(average = mean(duration..seconds.)/60)
## # A tibble: 29 x 2
##    shape    average
##    <chr>      <dbl>
##  1 changed    60   
##  2 changing   35.8 
##  3 chevron     8.09
##  4 cigar      36.4 
##  5 circle     63.0 
##  6 cone       26.9 
##  7 crescent  630   
##  8 cross      12.5 
##  9 cylinder   68.9 
## 10 delta      44.7 
## # ... with 19 more rows

Let’s make a graph. What do you think the code below is doing?

ufos %>% 
  group_by(shape) %>% 
  summarise(average = mean(duration..seconds.)/60,
            obs = n()) %>% 
  arrange(desc(average)) %>%  
  filter(average <= 20) %>%
  ggplot(aes(shape, average)) +
  geom_bar(aes(fill = obs), stat = "identity") +
  coord_flip()

We changed the fill of the bars based on the number of observations. We did this by setting fill to obs in aes. If we took it out of aes(), it would not based upon obs. But if we want we can set it to another color like green or black. While you can use any color hex, R has some built in colors. Run colors() in the console to see.

Plot 5: Where are they being seen?

The next plot is a lot more complicated. We’re going to make a map of where the most ufos are seen. Since each observation is recorded with the city where it was seen. We can use that to count the number of observations. But if we’re actually going to plot it, we’re going to need something more.

We’re going to plot it by mapping each point, via it’s longitude and latitude. Lets return to our friend the count() function, but insead of counting for each city, let’s also count for each unique longitude and latitude. In theory, this wil produce the same number of observations as counting just with city, but it appears that there are multiple locations per city. We aren’t going to worry about that for now. (While it is possible to fix this issue, it is beyond the scope of this tutorial.)

ufos %>% 
  count(city, longitude, latitude)
## # A tibble: 15,035 x 4
##    city                       longitude latitude     n
##    <chr>                          <dbl>    <dbl> <int>
##  1 abbeville                      -92.1     30.0     4
##  2 abbeville                      -82.4     34.2     2
##  3 abbeville (lake secession)     -82.4     34.2     1
##  4 aberdeen                      -124.      47.0     8
##  5 aberdeen                       -98.5     45.5     2
##  6 aberdeen                       -83.8     38.7     1
##  7 aberdeen                       -79.4     35.1     1
##  8 aberdeen                       -76.2     39.5     5
##  9 abilene                        -99.7     32.4    30
## 10 abilene                        -97.2     38.9     2
## # ... with 15,025 more rows

There are way too many observations for us to graph on a map. Let’s finish cleaning it and save it as a variable, like we did with the original data set.

ufo_map <- ufos %>% 
  count(city, longitude, latitude) %>% 
  filter(n >= 100)

Now that we have the data set we want, we can use it to create a scatterplot.

ufo_map %>% 
  ggplot(aes(longitude, latitude)) +
  geom_point()

This doesn’t look that great, which is why we’re going to use the maps package to add some borders using borders().

ufo_map %>% 
  ggplot(aes(longitude, latitude)) +
  borders("state") +
  coord_quickmap() +
  geom_point() 

We can see where our dots are now, but they don’t look very beautiful. Let’s add a bit more data. We can change the size and color of our points using the number of observations in the aes().

ufo_map %>% 
  ggplot(aes(longitude, latitude)) +
  borders("state") +
  coord_quickmap() +
  geom_point(aes(size = n, color = n))

We cleaned the map up a bit more! See if you can figure out wha the extra functions are doing!

ufo_map %>% 
  ggplot(aes(longitude, latitude)) +
  borders("state") +
  coord_quickmap() +
  geom_point(aes(size = n, color = n), alpha = 0.7) +
  theme_void() +
  guides(size = FALSE) + 
  scale_color_gradient(low = "#20dba4", high = "#0c4655") +
  labs(
    x = "",
    y = "",
    color = "Totals"
  ) +
  theme(
  legend.box.margin = margin(0,10,0,0)
  )

Appendix

Congratulations on finishing the tutorial! Now see if you can think up and answer some of your own questions!

Download R

Did you love R? You can download it onto your computer for unlimited and offline use!